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Abstract 

We consider analysis of Genetic Analysis Workshop 18 data, which involves multiple longitudinal traits and dense 
genome-wide single-nucleotide polymorphism (SNP) markers. We use a multivariate linear mixed model to account 
for the covariance of random effects and multivariate residuals. We divide the SNPs into groups according to the 
genes they belong to and score them using weighted sum statistics. We propose a penalized approach for genetic 
variant selection at the gene level. The overall modeling and penalized selection method is referred to as the 
penalized multivariate linear mixed model. Cross-validation is used for tuning parameter selection. A resampling 
approach is adopted to evaluate the relative stability of the identified genes. Application to the Genetic Analysis 
Workshop 18 data shows that the proposed approach can effectively select markers associated with phenotypes at 
gene level. 



Background 

The Genetic Analysis Workshop 18 (GAW18) data con- 
sists of multiple longitudinal traits and dense genome- 
wide single-nucleotide polymorphism (SNP) markers. A 
commonly used approach for identifying markers asso- 
ciated with traits is to conduct single-variant analysis 
and then adjust for multiple comparisons on each trait. 
However, for complex polygenic traits, single-variant 
analysis methods may not be appropriate as they fail to 
take into account the accumulated and/or joint effects 
of multiple genetic variants on the traits. In addition, 
analyzing each trait separately does not take into 
account the correlation among traits, and thus can be 
ineffective. To overcome these limitations, we developed 
a joint analysis approach referred to as the penalized 
multivariate linear mixed model (PMLMM). This 
approach takes into account covariance of both random 
effects and residuals and uses a group minimax concave 
penalty (MCP) approach [1] for variant selection at the 
gene level. A resampling approach is adopted to evaluate 
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the relative stability of the identified genes. Our analysis 
of the GAW18 data indicates that the proposed 
approach can effectively select markers associated with 
multiple traits at the gene level. 

Methods 

Consider a genetic association study with longitudinal mea- 
surements on N subjects, p genetic variants, and q environ- 
mental exposure covariates. Here a genetic variant can be a 
single SNP marker or a score representing a group of 
SNPs. For subject i, suppose that there are ft; longitudinal 
measurements on m traits. Let Y{ be the n; x m trait matrix 
for subject i. Let Y be the n x m trait matrix for all the N 

subjects, where n = n;- The transpose of Y is 

t—ii=i 

Yi = (Y[, ■ ■ ■, Y' H ). Let Xi be the n, x p matrix consisting of 
the genetic variant scores of subject i. Let Z; be the n, x q 
covariate matrix. We center all the measurements to have 
sample means equal to zero. When m = 1, this setting sim- 
plifies to that in Schelldorfer et al [2]. 
Consider the multivariate linear mixed model 



Y, ■= X,B + Z;C; + E„ i=l,...,N, 



(1) 



where Bisapxm matrix representing the effects of p 
genetic variants on m traits, and C, is a q x m matrix 



o 
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representing the subject specific effects of the covariates 
Zj for the i th subject. We treat Q as random effects. 

Assume that (a) Ei ~ MN «ixm( 0 ' that is> E . is 

l 

row-independent with column covariance matrix Ei, 
and each £; is independent for i=l,...,N; (b) 

Q ~ MN, xffl (0, where E 2 is the column covar- 

2 

iance matrix and D is the row covariance matrix, and 
each Q is independent for i = 1, ...,N; (c) each £, and Q 
is independent; and (d) Ei = £2 = E. 

Then Z;C; + E; ~ MN„ iXm (0, Z;DZ- + J nj ) and 
Y ~ MN nxffl (XB, ^\ V) where V = Diag(V u V N ) and 
Vi = Z{DZ\ + I nj , where I ni is an n* x identity matrix. A 
more detailed description of this model can be found in 
Liu et al [3]. 

From Dawid [4], the negative log-likelihood function is: 

-t(B, V, £) - constant + f log I J> | log |V| + itr£> - XBJ/V" 1 (y - XB)) (2) 

Hastie et al. [5] suggest using E = /y/n for estimating 
We estimate £> by using the estimates from m uni- 
variate linear mixed models and subsequently get the 
estimate y of V as y ; = Z;DZj + I nj . Given ^ and y, we 

can transform the negative likelihood function into a 

weighted least squares criterion for estimating B, which 

- -l 

is tr(^ (y - XB)iV~ l (y - XB)> For variant selection, 
we adopt the group MCP approach [6]. The overall 
penalized objective function is 

Q(B)=tr(£ \y- XByv-\y- XB)) + J2 P hl P(\\Bj\h;KY), (3) 

where Bj is the /* row of B and 

r\t\ 

p{t; X,y) = X I (1 - x/{yX)) + dx is the MCP with tun- 
Jo 

ing parameter X and regularization parameter Y [7]. 

For computation, we use a group coordinate descent 
algorithm [1]. The group MCP involves a regularization 
parameter and a tuning parameter. Generally speaking, 
smaller values of Y are better at retaining the unbiased- 
ness of the MCP penalty for large coefficients, but they 
have the risk of creating objective functions that have 
problems with nonconvexity [8], are difficult to opti- 
mize, and yield solutions that are discontinuous with 
respect to X. Simulation studies by Breheny and Huang 
[8] suggest that y = 6 is a reasonable choice. Therefore, 
we fix it to be 6 in our analyses. We search for the opti- 
mal value of X using 5-fold cross-validation. 

Results 

The GAW18 data set consists of dense genome-wide 
markers with longitudinal measurements on systolic and 



diastolic blood pressure (SBP and DBP) and other covari- 
ates. Other measurements include gender, age, year of 
examination, use of antihypertensive medications, and 
tobacco smoking at up to 4 time points. In this study, we 
analyze the 157 unrelated individuals using SBP and DBP 
as traits and other medical and demographic covariates as 
random effects. Gene annotations for SNP data are 
obtained from http://www.scandb.org. SNPs in each gene 
are scored using weighted sum statistics to generate gene- 
level measurements [9]. After quality control, we have the 
genetic scores of 10,400 genes for further analysis. SBP, 
DBP, and genetic scores are standardized to have zero 
means and unit variances. This procedure removes the 
estimation of intercepts and makes the genes comparable. 

We apply the proposed PMLMM to identify genetic var- 
iants that are associated with both SBP and DBP at the 
gene level. As a benchmark, we also analyze each trait 
separately using a penalized linear mixed model (PLMM) 
approach. Table 1 shows the genes identified using 
PMLMM. Table 2 summarizes the overlaps of genes 
selected using the different approaches. Although there is 
overlap, PMLMM and PLMM identify significantly differ- 
ent sets of genes. We evaluate the relative stability of iden- 
tification of each gene using a resampling approach and 
calculate the observed occurrence index (OOI) [10]. A lar- 
ger value of OOI indicates that the corresponding identi- 
fied gene is more stably identified. Table 1 also shows OOI 
results. The identified genes have reasonably high OOIs. 

Discussion 

In this study, we analyze the GAW18 data and develop 
a PMLMM approach. A multivariate linear mixed model 
is used to model variance components among traits and 
longitudinal measurements. A penalization approach is 
adopted for variant selection. In the estimation proce- 
dure, it can be considered heuristic to use and y as 
proposed. Assumptions (a) to (c) are standard in mixed 
models, but the assumption that Ei = E 2 may be restric- 
tive. Because our study is to identify multitrait-asso- 
ciated markers at the gene level, the restriction on 
variance components does not affect the selection result 
significantly. We are currently developing a similar 
approach to update variance components with more 
relaxed assumptions on Ei and E 2 . An iterative algo- 
rithm can be implemented to solve for B, and V. In 
variant selection, our method is designed to search for 
genes associated with all the traits considered. When 
different sets of genetic variants are suspected to be 
associated with different phenotypes, the sparse group 
penalization approach [11] can be applied. 

Conclusions 

We have presented a penalized multivariate linear mixed 
model (PMLMM) for detecting pleiotropic genetic 
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Table 1 Genes identified by PMLMM: estimates for SBP and DBP, and 001 



Gene 


SBP 


DBP 


001 


Gene 


SBP 


DBP 


001 


MMEL1 


0.002 


-0.002 


0.333 


FMEM41B 


0.027 


0.033 


0.403 


CD 5 2 


0.085 


0.060 


0.697 


ARNFL 


0.024 


0.006 


0.247 


DPH2 


0.071 


-0.032 


0.323 


SPTY2D 1 


0.025 


-0.007 


0.507 


C8A 


0.018 


0.032 


0.563 


CH5F1 


-0.008 


-0.031 


0.540 


DNAJB4 


-0.028 


-0.022 


0.333 


MRE11A 


-0.042 


-0.007 


0.623 


HS2ST1 


0.002 


0.006 


0.307 


EN0X1 


-0.068 


-0.032 


0.647 


PR0K1 


0.006 


0.010 


0.373 


LOC100132760 


-0.041 


0.048 


0.693 


THBS3 


-0.004 


0.001 


0.337 


SPRY2 


-0.004 


-0.005 


0.297 


C1orf182 


2E-04 


0.045 


0.490 


GABRG3 


-0.027 


0.006 


0.573 


TGFBR2 


-0.033 


-0.030 


0.627 


THBS1 


0.023 


0.028 


0.353 


LOC100129194 


0.005 


0.01 1 


0.217 


CSPG4 


0.098 


-0.012 


0.880 


LM0D3 


-0.013 


-0.028 


0.493 


C15orf27 


0.047 


-0.014 


0.490 


LOC653712 


0.017 


0.001 


0.450 


LOCI 001 28570 


0.026 


0.019 


0.283 


LAMP3 


0.034 


0.008 


0.627 


H0MER2 


0.024 


0.013 


0.250 


EIF2B5 


-0.014 


-0.014 


0.41 7 


ADAMFS17 


0.002 


0.002 


0.270 


EHHADH 


0.003 


-0.052 


0.677 


SLC16A11 


0.015 


-0.004 


0.170 


SFRS12 


0.005 


0.007 


0.290 


ALDH3A 1 


0.002 


0.021 


0.657 


C5orf32 


0.019 


-0.029 


0.560 


FU44815 


0.019 


-0.005 


0.210 


ZNF346 


0.001 


-0.024 


0.553 


TANC2 


0.003 


-0.001 


0.187 


LOC100128901 


-0.069 


0.006 


0.627 


PDE6G 


-0.056 


-0.012 


0.377 


OGDH 


0.123 


0.038 


0.777 


C19orf6 


0.013 


0.048 


0.577 


NSUN5 


0.035 


-0.014 


0.660 


FMEM146 


-0.001 


-0.063 


0.550 


PPP1R3A 


-0.034 


-0.041 


0.453 


SFXW 


4E-04 


0.015 


0.333 


MEST 


-2E-04 


-3E-04 


0.197 


RLN3 


0.024 


-0.034 


0.603 


N0M1 


0.029 


-0.001 


0.630 


CYP4F1 1 


0.018 


0.005 


0.127 


FU41200 


0.013 


0.005 


0.333 


LOC728326 


0.048 


-0.040 


0.547 


LRRC19 


-0.015 


-0.040 


0.480 


ZNF585A 


0.028 


0.082 


0.720 


CCIN 


0.006 


-0.004 


0.437 


SUPF5H 


0.020 


-0.014 


0.233 


LOCW0130911 


0.004 


0.024 


0.467 


FU 10490 


0.004 


2E-04 


0.327 


PTCH1 


0.004 


-1E-04 


0.300 


ZNF331 


0.027 


0.003 


0.330 


DFNB3 / 


0.057 


-0.008 


0.710 


BACE2 


-0.116 


-0.074 


0.940 


OR52D1 


0.023 


-0.017 


0.490 


KRTAPW-12 


0.002 


0.003 


0.233 
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Table 2 Overlap of selected genes between PMLMM and 
PLMM 





PMLMM 


PLMM* 


PLMMt 


PMLMM 


64 


24 


16 


PLMM 1 




40 


0 


PLMM 2 






29 



•PLMM on SBP. 
tPLMM on DBP. 



associations among multiple traits in the presence of 
pedigree structures. The proposed approach combines 
the advantages of mixed models that allow for elegant 
correction for pedigree-based family data and integrative 
analysis of multiple traits. Compared with PLMM which 
considers one trait at a time, the proposed PMLMM 
can achieve better performance when the pleiotropic 
effect is appropriately modeled. 
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